ranking task
- Europe > France > Île-de-France > Paris > Paris (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- Europe > Netherlands > South Holland > Dordrecht (0.04)
- Asia > Middle East > Jordan (0.04)
Socrates-Mol: Self-Oriented Cognitive Reasoning through Autonomous Trial-and-Error with Empirical-Bayesian Screening for Molecules
Wang, Xiangru, Jiang, Zekun, Yang, Heng, Tan, Cheng, Lan, Xingying, Xu, Chunming, Zhou, Tianhang
Molecular property prediction is fundamental to chemical engineering applications such as solvent screening. We present Socrates-Mol, a framework that transforms language models into empirical Bayesian reasoners through context engineering, addressing cold start problems without model fine-tuning. The system implements a reflective-prediction cycle where initial outputs serve as priors, retrieved molecular cases provide evidence, and refined predictions form posteriors, extracting reusable chemical rules from sparse data. We introduce ranking tasks aligned with industrial screening priorities and employ cross-model self-consistency across five language models to reduce variance. Experiments on amine solvent LogP prediction reveal task-dependent patterns: regression achieves 72% MAE reduction and 112% R-squared improvement through self-consistency, while ranking tasks show limited gains due to systematic multi-model biases. The framework reduces deployment costs by over 70% compared to full fine-tuning, providing a scalable solution for molecular property prediction while elucidating the task-adaptive nature of self-consistency mechanisms.
- Europe > France > Île-de-France > Paris > Paris (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- (2 more...)
- Information Technology > Information Management (0.94)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)
R1-Ranker: Teaching LLM Rankers to Reason
Feng, Tao, Hua, Zhigang, Lei, Zijie, Xie, Yan, Yang, Shuang, Long, Bo, You, Jiaxuan
Large language models (LLMs) have recently shown strong reasoning abilities in domains like mathematics, coding, and scientific problem-solving, yet their potential for ranking tasks, where prime examples include retrieval, recommender systems, and LLM routing, remains underexplored. Ranking requires complex reasoning across heterogeneous candidates, but existing LLM-based rankers are often domain-specific, tied to fixed backbones, and lack iterative refinement, limiting their ability to fully exploit LLMs' reasoning potential. To address these challenges, we propose R1-Ranker, a reasoning-incentive framework built on reinforcement learning, with two complementary designs: DRanker, which generates full rankings in one shot, and IRanker, which decomposes ranking into an iterative elimination process with step-wise rewards to encourage deeper reasoning. We evaluate unified R1-Rankers on nine datasets spanning recommendation, routing, and passage ranking, showing that IRanker-3B consistently achieves state-of-the-art performance, surpasses larger 7B models on some tasks, and yields a 15.7% average relative improvement. Ablation and generalization experiments further confirm the critical role of reinforcement learning and iterative reasoning, with IRanker-3B improving zero-shot performance by over 9% on out-of-domain tasks and reasoning traces boosting other LLMs by up to 22.87%. These results demonstrate that unifying diverse ranking tasks with a single reasoning-driven foundation model is both effective and essential for advancing LLM reasoning in ranking scenarios.
- Overview (1.00)
- Research Report > New Finding (0.87)
- Media > Music (1.00)
- Media > Film (1.00)
- Leisure & Entertainment > Games > Computer Games (1.00)
- (2 more...)
- Europe > France > Île-de-France > Paris > Paris (0.04)
- North America > Canada (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- (2 more...)
Overview of the TREC 2022 deep learning track
Craswell, Nick, Mitra, Bhaskar, Yilmaz, Emine, Campos, Daniel, Lin, Jimmy, Voorhees, Ellen M., Soboroff, Ian
At TREC 2022, we hosted the fourth TREC Deep Learning Track continuing our focus on benchmarking ad hoc retrieval methods in the large-data regime. As in previous years [Craswell et al., 2020, 2021a, 2022], we leverage the MS MARCO datasets [Bajaj et al., 2016] that made hundreds of thousands of human annotated training labels available for both passage and document ranking tasks. In addition, last year we refreshed both the passage and the document collections which also led to a nearly 16 times increase in the size of the passage collection and nearly four times increase in the document collection size. In addition to evaluating ranking methods on the larger collections, the data refresh also aimed at providing additional metadata-- e.g., passage-to-document mappings--that may be useful for ranking, as well as incorporating some fixes for known text encoding issues in previous versions of the datasets. This year we continue to benchmark against these larger passage and document collections. However, the significant increase in collection sizes last year led to a corresponding increase in the number of relevant results in the collection per query and the existing judgment budget was exceeded before a reasonably complete set of these relevant results could be identified by the NIST judges. This large number of relevant raised serious concerns about the test collection generated by last year's track, relating to reusability and also score saturation [V oorhees et al., 2022, Craswell et al., 2022]. To address these concerns, we made three changes this year with the goal of reducing the number of relevant results per query and in general the judgment costs so that they may be reused to obtain more complete set of judgments and consequently a more reusable test collection: [1] We used test queries that did not contribute to the MS MARCO corpus.
- Europe > Austria > Vienna (0.05)
- North America > United States > Illinois > Champaign County > Urbana (0.04)
Overview of the TREC 2023 deep learning track
Craswell, Nick, Mitra, Bhaskar, Yilmaz, Emine, Rahmani, Hossein A., Campos, Daniel, Lin, Jimmy, Voorhees, Ellen M., Soboroff, Ian
This is the fifth year of the TREC Deep Learning track. As in previous years, we leverage the MS MARCO datasets that made hundreds of thousands of human-annotated training labels available for both passage and document ranking tasks. We mostly repeated last year's design, to get another matching test set, based on the larger, cleaner, less-biased v2 passage and document set, with passage ranking as primary and document ranking as a secondary task (using labels inferred from passage). As we did last year, we sample from MS MARCO queries that were completely held out, unused in corpus construction, unlike the test queries in the first three years. This approach yields a more difficult test with more headroom for improvement. Alongside the usual MS MARCO (human) queries from MS MARCO, this year we generated synthetic queries using a fine-tuned T5 model and using a GPT-4 prompt. The new headline result this year is that runs using Large Language Model (LLM) prompting in some way outperformed runs that use the "nnlm" approach, which was the best approach in the previous four years. Since this is the last year of the track, future iterations of prompt-based ranking can happen in other tracks. Human relevance assessments were applied to all query types, not just human MS MARCO queries. Evaluation using synthetic queries gave similar results to human queries, with system ordering agreement of $τ=0.8487$. However, human effort was needed to select a subset of the synthetic queries that were usable. We did not see clear evidence of bias, where runs using GPT-4 were favored when evaluated using synthetic GPT-4 queries, or where runs using T5 were favored when evaluated on synthetic T5 queries.
Overview of the TREC 2021 deep learning track
Craswell, Nick, Mitra, Bhaskar, Yilmaz, Emine, Campos, Daniel, Lin, Jimmy
At TREC 2021, we hosted the third TREC Deep Learning Track continuing our focus on benchmarking ad hoc retrieval methods in the large-data regime. As in previous years [Craswell et al., 2020, 2021a], we leverage the MS MARCO datasets [Bajaj et al., 2016] that made hundreds of thousands of human annotated training labels available for both passage and document ranking tasks. In addition, this year we refreshed both the document and the passage collections which also led to a nearly four times increase in the document collection size and nearly 16 times increase in the size of the passage collection. In addition to evaluating retrieval methods on the larger collections, the data refresh also aimed to provide additional metadata-- e.g., passage-to-document mappings--that may be useful for ranking as well as incorporate some fixes for known text encoding issues in previous versions of the datasets. This year, in addition to focusing on TREC-style blind evaluation of neural methods against strong traditional baselines, the track also encouraged participating groups to annotate their runs based on whether they employ dense retrieval methods and whether their ranking pipeline is a single stage retrieval process. The goal was to both encourage more explorations of neural methods in first stage retrieval as well as to allow analysis of how these emerging methods compare to previous state-of-the-art. Deep neural ranking models that employ large scale pretraininig continued to outperform traditional retrieval methods this year. We also found that single stage retrieval can achieve good performance on both tasks although they still do not perform at par with multistage retrieval pipelines. Finally, the increase in the collection size and the general data refresh raised some questions about completeness of NIST judgments and the quality of the training labels that were mapped to the new collections from the old ones which we discuss later in this report.
- Europe > Austria > Vienna (0.05)
- North America > United States > Illinois > Champaign County > Urbana (0.04)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
Immersive Explainability: Visualizing Robot Navigation Decisions through XAI Semantic Scene Projections in Virtual Reality
de Heuvel, Jorge, Müller, Sebastian, Wessels, Marlene, Akhtar, Aftab, Bauckhage, Christian, Bennewitz, Maren
End-to-end robot policies achieve high performance through neural networks trained via reinforcement learning (RL). Yet, their black box nature and abstract reasoning pose challenges for human-robot interaction (HRI), because humans may experience difficulty in understanding and predicting the robot's navigation decisions, hindering trust development. We present a virtual reality (VR) interface that visualizes explainable AI (XAI) outputs and the robot's lidar perception to support intuitive interpretation of RL-based navigation behavior. By visually highlighting objects based on their attribution scores, the interface grounds abstract policy explanations in the scene context. This XAI visualization bridges the gap between obscure numerical XAI attribution scores and a human-centric semantic level of explanation. A within-subjects study with 24 participants evaluated the effectiveness of our interface for four visualization conditions combining XAI and lidar. Participants ranked scene objects across navigation scenarios based on their importance to the robot, followed by a questionnaire assessing subjective understanding and predictability. Results show that semantic projection of attributions significantly enhances non-expert users' objective understanding and subjective awareness of robot behavior. In addition, lidar visualization further improves perceived predictability, underscoring the value of integrating XAI and sensor for transparent, trustworthy HRI.
- Europe > Germany > North Rhine-Westphalia > Cologne Region > Bonn (0.04)
- Europe > Germany > Rheinland-Pfalz > Mainz (0.04)
- Europe > Finland > Uusimaa > Helsinki (0.04)
- Africa > Mozambique > Gaza Province > Xai-Xai (0.04)
- Questionnaire & Opinion Survey (1.00)
- Research Report > New Finding (0.86)
OrdRankBen: A Novel Ranking Benchmark for Ordinal Relevance in NLP
Wang, Yan, Qian, Lingfei, Peng, Xueqing, Huang, Jimin, Feng, Dongji
The evaluation of ranking tasks remains a significant challenge in natural language processing (NLP), particularly due to the lack of direct labels for results in real-world scenarios. Benchmark datasets play a crucial role in providing standardized testbeds that ensure fair comparisons, enhance reproducibility, and enable progress tracking, facilitating rigorous assessment and continuous improvement of ranking models. Existing NLP ranking benchmarks typically use binary relevance labels or continuous relevance scores, neglecting ordinal relevance scores. However, binary labels oversimplify relevance distinctions, while continuous scores lack a clear ordinal structure, making it challenging to capture nuanced ranking differences effectively. To address these challenges, we introduce OrdRankBen, a novel benchmark designed to capture multi-granularity relevance distinctions. Unlike conventional benchmarks, OrdRankBen incorporates structured ordinal labels, enabling more precise ranking evaluations. Given the absence of suitable datasets for ordinal relevance ranking in NLP, we constructed two datasets with distinct ordinal label distributions. We further evaluate various models for three model types, ranking-based language models, general large language models, and ranking-focused large language models on these datasets. Experimental results show that ordinal relevance modeling provides a more precise evaluation of ranking models, improving their ability to distinguish multi-granularity differences among ranked items-crucial for tasks that demand fine-grained relevance differentiation.
- North America > United States > District of Columbia > Washington (0.05)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Minnesota (0.04)
- Asia > Myanmar > Tanintharyi Region > Dawei (0.04)